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DNA sequencing enables the discovery of 
new genes in high-throughput, low-cost 
experiments. Conversely, gene function is 
determined by low-throughput, high-cost 
experiments. This inverse relationship for 
these two types of data is a major impedi- 
ment in meeting one of the major scientific 
challenges of our time — the understand- 
ing of genomes. 

This mismatch in throughput is illus- 
trated by considering the progress made 
for one of the earliest sequenced genomes, 
that of Mycobacterium tuberculosis H37Rv 
(Mtb). When its genome was published 
in 1998, more than a quarter of its genes 
had no known function (Cole et al., 
1998). Our lack of knowledge about these 
approximately 1000 "conserved hypotheti- 
cal" genes in Mtb represents a serious defi- 
ciency in our understanding of its biology. 
Now, after more than a decade of progress, 
our knowledge of those proteins' func- 
tions is essentially unchanged — there are 
still greater than 900 genes with no known 
function (Lew et al., 2011). In contrast, 
during this same period, the scientific 
community has sequenced approximately 
18,000 new genomes (Pagani et al., 2012), 
containing millions of new hypothetical 
proteins. Apparently, the vector of our 
progress has tipped decisively away from 
data interpretation and comprehension, 
and toward mere data collection. 

To address the issue of gene function 
testing and annotation for all microbes, 
we founded COMBREX (COMputational 
BRidge to Experiments), an endeavor 
aimed at accelerating the rate of gene func- 
tion validation (Anton et al., 2013). Two of 
COMBREX's more prominent initiatives 
were the creation of a comprehensive 



database for protein function data (http:// 
combrex.bu.edu), and the deployment 
of a crowdsourcing platform to catalyze 
protein function experimentation. In the 
course of these two efforts, it became 
apparent that fundamental changes in 
approaches to the problem of protein 
function determination were needed if 
there was any hope of keeping pace 
with DNA sequencing. We suggest that 
the community work together to (1) re- 
establish the connection between existing 
gene annotation and the foundational 
experimental data that supports all anno- 
tation, (2) develop experiment design 
principles to help guide the identification 
of maximally informative targets for func- 
tion validation, (3) invest in the develop- 
ment of higher- throughput approaches for 
the testing of protein function, and (4) 
provide an expedited publication pathway 
for reporting experimental results of gene 
function, analogous to the reporting of 
newly sequenced genomes in the journal 
"Standards in Genomic Sciences." 

COMPREHENSIVE ASSESSMENT OF 
PROTEIN FUNCTION STATUS 

We recently examined protein function 
status from greater than 1000 completely 
sequenced microbes (Anton et al, 2013). 
For 3.3 million identified genes, we can 
currently document experimentally deter- 
mined functions for just 0.4% of the 
proteins (13,665 proteins). 76% of the 
proteins are annotated using computa- 
tional methods, and the remaining 24% 
of proteins (close to 1 million) have no 
known, or predicted, functions. Thus, 
a very small number of experimental 
data points provides the foundation for 



an enormously disproportionate number 
of predicted gene function annotations. 
(While the total number of experimen- 
tally characterized proteins is unknown, 
we estimate the number to be above 
50,000). 

An unavoidable consequence of the fact 
that only a small proportion of annota- 
tions are based directly on experiment is 
that predicted functions are often based 
on weak chains of inference. This can 
greatly contribute to the proliferation of 
incorrect annotations. When a newly- 
discovered gene is annotated based on 
similarity to a experimentally character- 
ized gene, it then, itself, becomes a source 
for future annotation. As a result, genes 
that will be annotated in the future may 
be annotated based upon genes that are 
themselves far removed from solid exper- 
imental evidence. Compounding confu- 
sion, in the vast majority of cases, the 
original experimental source has not been 
recorded or preserved. One study esti- 
mated that for 37 protein families and 
7000 sequences, the overall misannotation 
rate is roughly 40% (Schnoes et al, 2009), 
yet the vast majority of annotations are 
frequently unquestioned by many working 
scientists. 

CROWDSOURCING THE 
EXPERIMENTAL TESTING OF PROTEIN 
FUNCTION 

In the first phase of the project, 
COMBREX awarded funds to 14 labs, 
and 140 proteins were examined. One of 
the primary criteria for these applications 
was prior published work using the pro- 
posed enzyme assay. The rationale for this 
was that experimental efficiency will be 
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greatest, and the costs minimized, in lab- 
oratories that already have the reagents, 
equipment, and expertise necessary to 
perform the experiments quickly and 
accurately. Research on many of these pro- 
teins has been successfully completed, and 
results have been published (Chatterjee 
et al., 2012; Clark et al, 2012; Francis 
et al, 2012; Phillips et al, 2012; Rodionova 
et al, 2012; Su et al, 2012; Xu et al., 2012; 
Choi et al, 2013; Elkin et al, 2013), while 
research on the others is still in progress. 

When a protein's function is experi- 
mentally determined, it not only affects 
its own annotation, it changes the prob- 
ability that other proteins that are close 
in sequence space have a similar anno- 
tation. Thus the potential impact of the 
experiments COMBREX was able to fund 
is much larger than simply the proteins 
tested: the 140 proteins reside in Protein 
Clusters containing in total more than 
3200 proteins, which are therefore quite 
close in sequence space and likely to have 
similar functions. At a further distance 
threshold, there are over 60,000 proteins 
that have BLAST £-values less than le-05. 
The 140 proteins have eight Pfam-defined 
domains of unknown function (DUFs), 
resulting in novel predictive insights for all 
other proteins containing these DUFs (a 
total of 1610 in the COMBREX Database). 
Finally, 37 of these 140 proteins con- 
tain a total of 28 unique Pfam-defined 
domains shared with human proteins, 
providing functional insights that may 
impact human health. 

Several of the COMBREX awards went 
to labs that had participating undergrad- 
uate students, highlighting that the types 
of experiments COMBREX funds meshes 
well with the interests and capabilities 
of undergraduate students eager to par- 
ticipate in original research, and with 
STEM educational goals of many science 
departments. As an example, undergradu- 
ate students at the University of Virginia 
were able to successfully determine bio- 
chemical activities and enzyme kinetics 
for three uncharacterized proteins (Elkin 
et al, 2013). COMBREX hopes to replicate 
these successes as part of an educational 
component at numerous undergraduate 
institutions, in a manner analogous to the 
Small World Initiative, developed at Yale, 
which enlists undergraduates in the search 
for new antibiotics (Barral et al., 2014). 



CONNECTING ANNOTATION TO 
EXPERIMENTAL SOURCES 

When confronted today with the task of 
annotating a newly discovered hypotheti- 
cal protein, the use of BLAST quickly and 
robustly identifies homologous proteins. 
This sometimes provides clues to potential 
gene function. However, just as often, one 
is inundated with matches to other hypo- 
thetical proteins that reveal little about 
possible gene function, and obscures sim- 
ilarities to experimentally characterized 
proteins. 

We developed a prototype tool, named 
COMBLAST, to associate query genes with 
the various types of experimental evi- 
dence and data stored in COMBREX. 
COMBLAST returns results summarized 
in a format that concisely captures the 
functional features of similar proteins. 
COMBLAST output includes a trace to 
experimental evidence of function via 
sequence and domain similarity, to avail- 
able structural information for related 
proteins, to association with clinically rel- 
evant phenotypes such as antibiotic resis- 
tance, and other relevant information. 

The first application of COMBLAST 
was deployed in a collaboration led by D. 
Wood and S. Salzberg (Wood et al., 2012). 
We analyzed 1474 prokaryotic genome 
annotations in GenBank and identified 
25,394 potential genes that were very likely 
overlooked during the original annota- 
tions. COMBREX was able to provide sup- 
porting evidence of their protein-coding 
nature, and we were able to associate 
13,633 of the proteins to published bio- 
chemical evidence. Providing explicit links 
to documented proteins represents one 
approach for supporting annotations of 
"missing proteins" (Lane et al., 2014), until 
comprehensive proteomic surveys confirm 
their expression (Kim et al., 2014). While 
an efficient and user-friendly interface to 
the COMBLAST software is under devel- 
opment, when finally deployed, it will 
enable any scientist to quickly re-assess 
the validity of any existing annotation, or 
to generate hypotheses based solidly on 
existing experimental evidence. 

DESIGNING EXPERIMENTS WITH 
INCREASED INFORMATION CONTENT 

The ability to only perform a small num- 
ber of experiments places a premium 
on every attempted experiment, making 



an important consideration the possible 
amount of information that will be derived 
from any one experiment. This "informa- 
tion gain" from the experimental analy- 
sis of a given protein is dependent on 
the number of proteins nearby to it in 
sequence space, as well as the distances 
of that protein to previously characterized 
proteins. 

In the most simplistic sense, characteri- 
zation of a judiciously chosen protein gen- 
erates or improves predictions for many 
other proteins across many genomes, while 
characterization of a protein related to few 
or no other proteins may have a much 
smaller impact. More formally, for func- 
tion prediction methods that report prob- 
abilities with their predictions (Letovsky 
and Kasif, 2003), the information gain 
from an experiment can be quantified as 
the reduction in the estimated probabil- 
ity of prediction error, summed across all 
predictions. 

In COMBREX, we implemented a 
proof-of-concept prioritization scheme 
that ranked proteins for experimental test- 
ing, which roughly paralleled expected 
trends in information gain. The "ideal" 
COMBREX target is a protein close to 
many other uncharacterized proteins, and 
relatively far from any protein of known 
function, but not so far that it would 
preclude high quality predictions of the 
protein's function for the experimentalist 
to test. A second, "soft" guideline was the 
encouragement to test more than a single 
protein within a family. Typically, there is 
only a marginal increase in labor to bio- 
chemically test three similar proteins in 
parallel, when one has procured all the 
reagents, and created all the buffers for the 
testing of a single protein, yet the informa- 
tion gain can be significantly increased, as 
one starts to define boundaries of spaces 
in which contain proteins with a specific 
function. Put another way, these design 
principles do not provide answers — they 
help experimentalists ask better questions. 

DEVELOPMENT OF 
HIGH-THROUGHPUT TECHNOLOGIES 
FOR GENE FUNCTION DETERMINATION 

The functional characterization of hypo- 
thetical proteins with only remote 
sequence homology to known proteins 
can be challenging, as there may be few 
clues to guide initial experiments. Several 
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groundbreaking efforts have circumvented 
this obstacle by deploying technologies 
that utilize a large diverse set of reagents, 
or cast their net over a large, complex 
pool of proteins. Yakunin and cowork- 
ers (Kuznetsova et al, 2005; Proudfoot 
et al., 2008) screen individual proteins 
for general activity using a set of reagents 
selected to be generically active (testing 
for broad functionalities, such as phos- 
phatase, dehydrogenase, protease, etc.), 
which is then followed by the use of more 
specific substrates. Cravatt and coworkers 
(Cravatt et al., 2008; Simon and Cravatt, 
2010) have pioneered a complementary 
approach, "activity-based protein profil- 
ing," enriching enzymes of a particular 
class using reagents that contain affin- 
ity labels, reactive groups and a tag for 
isolation, and then identifying proteins 
by mass spectrometry. They and others 
have applied this technique to multiple 
classes of enzymes including: hydro- 
lases, proteases, kinases, phosphatases, 
histone deacetylases, glycosidases, and 
oxidoreductases. 

We have recently developed a workflow 
for the characterization of hypothetical 
proteins and applied it to six proteins 
from H. pylori (Choi et al, 2013). We uti- 
lized an affinity method to generate initial 
hypotheses for hypothetical proteins, and 
then confirmed reactivity using standard 
recombinant DNA technology and tradi- 
tional in vitro biochemistry. The affinity 
reagents utilize nano-particles coated with 
substrate analogs to enrich proteins from 
cell lysates of H. pylori. Isolated proteins 
were identified using mass spectrometry. 
After cloning and expression in E. coli, 
the proteins were tested for biochemical 
activities related to the molecular frag- 
ment serving as the affinity bait. Proteins 
characterized include a guanosine triphos- 
phate (GTP) cyclohydrolase (HP0959), an 
ATPase (HP1079), an adenosine deam- 
inase (HP0267), a phosphodiesterase 
(HP1042), an aminopeptidase (HP1037), 
and new substrates were characterized for 
a peptidoglycan deacetylase (HP0310). 

THE NEED FOR CONVENIENT 
PUBLICATION PATHWAYS FOR 
IMPROVED DISSEMINATION OF 
RESULTS 

We suspect that a tremendous amount 
of pertinent experimental gene function 



information is lost to the community at 
large because of difficulties associated with 
finding appropriate venues to dissemi- 
nate the information. The genomics com- 
munity addressed this need smartly with 
the creation of an open access journal, 
Standards in Genomic Sciences. This jour- 
nal typically publishes short, straightfor- 
ward descriptions reporting a new genome 
sequence based on a standard template. 

There is a need for a similar publica- 
tion mechanism for gene function data. It 
appears that currently, the scientific com- 
munity's publication standards generally 
dictate that a successful biochemical exper- 
iment alone does not meet the criteria 
for a minimum publishable unit. Without 
accompanying data about the gene's role 
in the biology of the organism, or obser- 
vations on associated phenotypic effects, 
biochemical results are not "enough" of 
a story. As a result, useful experimental 
information remains hidden in individual 
notebooks, lost to the wider community. 

In our opinion, there would be great 
value in a publication venue that accepted 
streamlined "biochemical reports" in a 
routine manner. Minimal data provided 
would be the sequence of the gene, 
the protein production method, the bio- 
chemical assay, and an interpretation of 
the results. Similarly, simple reports on 
gene overexpression or knockouts and 
their phenotypic effects would permit 
the dissemination of meaningful func- 
tional data. Such data could be linked to 
COMBREX and other frequently accessed 
gene databases to expedite the dissemina- 
tion process by avoiding human curation 
or processing. 

SUMMARY 

There needs to be a paradigm shift in the 
approach taken to determine and assign 
gene function if there is to be any hope 
of realizing the potential benefits from 
the torrent of new genome sequences. 
We advocate here for: (1) experimental 
designs that test sets of maximally infor- 
mative proteins, (2) maximal informa- 
tion extraction from every experimental 
result, with explicit traces provided to 
related proteins, (3) enhanced opportu- 
nities for collaboration among compu- 
tational and experimental researchers to 
share predictions and results, and dis- 
tribute limited resources, (4) investment 



by granting agencies in the development 
of high-throughput gene function testing, 
and (5) the creation of new publication 
options to report and share the results of 
experiments that are performed. 
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